{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Scaling and binning data\n", "\n", "Data comes in all shapes and forms, but sometimes it's essential to get data into the same range of values. Sometimes scaling the data is not enough, but grouping data into similar categories is beneficial for analysis.\n", "\n", "Data comes in various shapes and forms, including numerical, categorical, and textual data. Analyzing different types of data often requires extra preprocessing steps. \n", "\n", "In some cases, data needs to be standardized to a standard range of values, such as scaling numerical data. However, scaling alone may not be enough, and grouping data into similar categories can be more beneficial for analysis. Grouping data can also simplify complex datasets and improve the accuracy of models.\n", "\n", "This notebook will discuss different techniques for grouping data, including binning, one-hot encoding, and clustering. We will also explore the benefits and limitations of each technique and how to choose the most appropriate method based on the data and analysis goals." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How To" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"data/housing.csv\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Min-Max Scaling\n", "\n", "$x' = \\frac{x - \\text{min}(x)}{\\text{max}(x)-\\text{min}(x)}$" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximityhouse_value_minmax
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY0.902266
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY0.708247
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY0.695051
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY0.672783
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY0.674638
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \\\n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY \n", "\n", " house_value_minmax \n", "0 0.902266 \n", "1 0.708247 \n", "2 0.695051 \n", "3 0.672783 \n", "4 0.674638 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"house_value_minmax\"] = (df.median_house_value-df.median_house_value.min()) / (df.median_house_value.max()-df.median_house_value.min())\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Z-Score\n", "$x' = \\frac{x - \\bar{x}}{\\sigma}$" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximityhouse_value_minmaxhouse_value_z
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY0.9022662.129580
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY0.7082471.314124
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY0.6950511.258663
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY0.6727831.165072
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY0.6746381.172871
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \\\n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY \n", "\n", " house_value_minmax house_value_z \n", "0 0.902266 2.129580 \n", "1 0.708247 1.314124 \n", "2 0.695051 1.258663 \n", "3 0.672783 1.165072 \n", "4 0.674638 1.172871 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"house_value_z\"] = (df.median_house_value - df.median_house_value.mean()) / df.median_house_value.std()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binning" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximityhouse_value_minmaxhouse_value_zprice_range
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY0.9022662.129580luxury
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY0.7082471.314124high
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY0.6950511.258663high
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY0.6727831.165072high
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY0.6746381.172871high
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \\\n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY \n", "\n", " house_value_minmax house_value_z price_range \n", "0 0.902266 2.129580 luxury \n", "1 0.708247 1.314124 high \n", "2 0.695051 1.258663 high \n", "3 0.672783 1.165072 high \n", "4 0.674638 1.172871 high " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bin_names = [\"cheap\", \"medium\", \"high\", \"luxury\"]\n", "df[\"price_range\"] = pd.cut(df.house_value_minmax, \n", " [0, .25, .5, .75, 1],\n", " labels=bin_names)\n", "df.head()\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximityhouse_value_minmaxhouse_value_zprice_rangeprice_range_quantile
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY0.9022662.129580luxuryluxury
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY0.7082471.314124highluxury
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY0.6950511.258663highluxury
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY0.6727831.165072highluxury
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY0.6746381.172871highluxury
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \\\n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY \n", "\n", " house_value_minmax house_value_z price_range price_range_quantile \n", "0 0.902266 2.129580 luxury luxury \n", "1 0.708247 1.314124 high luxury \n", "2 0.695051 1.258663 high luxury \n", "3 0.672783 1.165072 high luxury \n", "4 0.674638 1.172871 high luxury " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"price_range_quantile\"] = pd.qcut(df.house_value_minmax, \n", " 4, labels = bin_names)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[luxury, high, medium, cheap]\n", "Categories (4, object): [cheap < medium < high < luxury]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.price_range_quantile.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Scaling" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "out = preprocessing.MinMaxScaler().fit_transform(df[[\"median_house_value\"]])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 True\n", "4 False\n", " ... \n", "20635 False\n", "20636 False\n", "20637 True\n", "20638 True\n", "20639 True\n", "Name: house_value_minmax, Length: 20640, dtype: bool" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.squeeze(out) == df.house_value_minmax" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.90226638] 0.9022663824066705\n" ] } ], "source": [ "print(out[0], df.house_value_minmax[0])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.allclose(np.squeeze(out), df.house_value_minmax)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 2.12963148],\n", " [ 1.31415614],\n", " [ 1.25869341],\n", " ...,\n", " [-0.99274649],\n", " [-1.05860847],\n", " [-1.01787803]])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preprocessing.StandardScaler().fit_transform(df[[\"median_house_value\"]])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1.88044789],\n", " [ 1.23204134],\n", " [ 1.18794143],\n", " ...,\n", " [-0.60223945],\n", " [-0.6546081 ],\n", " [-0.62222222]])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preprocessing.RobustScaler().fit_transform(df[[\"median_house_value\"]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Familiarize yourself with the preprocessing library in scikit-learn." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Wikipedia - Feature Scaling](https://en.wikipedia.org/wiki/Feature_scaling)\n", "- [Scikit-Learn Scaling](https://scikit-learn.org/stable/modules/preprocessing.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }